The ZIFCO project studies many facets of respiratory infections in Germany. Participants of the German National Cohort (Nationale Kohorte, NAKO) were asked a series of questions, such as whether they certain symptoms, via the app PIA. In parallel, blood and swab samples were taken and analyzed by different labs.
Here we process raw data sets generated by ZIFCO and produce two series of cleaned data sets. The first is meant for further analyses in a format most convenient for processing in scientific programming languages such as R or Python. The second series of data sets, derived from the first, fulfills all criteria for an upload in NAKO’s central data base and is accompanied by a meta-data table, as required.
The data processing and export is performed with a suite of custom R scripts. Where applicable the processing steps have been programmed in a general and flexible manner. Further steps are specific to specific data sets, sometimes to certain data entries.
In this report data sets are referred to their names used during
processing. These defined in the configuration file, see section One central configuration
file for a preview of the configuration and the link between data
set names and data-set files. For example, the name
nasal_swabs_pcr refers to the data set contained in the
file lab_results.csv.
Two data sets are exports from the PIA questionnaires,
answersis the direct export of 15 December 2022,
answers_backup a data base export. The latter is used
because some questionnaire had been inadvertently deleted and thus must
be added to the normal export.
pia_codebook is the code book of PIA: it contains the
list of questionnaires, questions, follow-up questions, possible
answers, as well as a number of meta data on these variables.
consent lists whether participants gave different
consents, as well as some are in fact test entries and should be removed
during processing.
cpt_hub and cpt_pia refer to blood samples
as recorded by HUB and PIA, respectively. They are merge into a single
data set cpt during processing.
examination contains examination dates for a data freeze
of 26 July 2022.
nasal_swabs_pcr contains the participant pseudonym,
sample ID’S, and PCR results of tests against a variety of respiratory
viruses.
pbmc, plasma and swabs are the
records of the corresponding samples at HUB. pbmc is
obtained by merging the two raw data sets pbmc_1 and
pbmc_2 during processing.
samples is a lookup table for matching sample ID’s and
participant pseudonyms.
During processing, three further tables are generated for matching
samples and participants, ids_lookup,
ids_lookup_1, ids_lookup_2, see below for
details.
Note that currently only nasal_swabs_pcr contains
results of sample analyses.
The tabs below show previews of all raw and processed data sets. The data formatted for NAKO as well as the corresponding meta-data are shown as well. For each data set beside code book and NAKO meta data a random sample of 1000 entries has been chosen, different entries are shown before and after processing.
Note that dates and date-times are encoded as numbers in Excel and only in Excel displayed as proper dates. For example, July 5, 2011 will be encoded as 40729. We import data as close to raw as possible, the dates in Excels are imported as strings of numbers.
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
Note that cells with missing values are shown as empty here but contain null in the exported CSV files.
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
Many of the options which users might want to change are listed in the configuration file config.yml. This is a text file that can be opened and edited with any text processing software. (Clicking on the link opens it the browser but doesn’t allow one to edit it.) On Windows, Notepad++ offers convenient functions, but for example Wordpad can be used as well.
The extension .yml indicates it is structured as a YAML file. It is similar to JSON, but more flexible. It allows the rapid definition and editing of structured, nested data. The main rules are the following:
All directory paths should be written with a slash / separating directories instead of the Windows standard \.
Please see the accompanying file config.yml. The resulting information can be visualized as a nested list. Below is a visualization of the configuration used to process the data (it doesn’t correspond to the syntax used in the file):
One notable feature is the possibility of mapping many variables
across data sets to one variable which will have the same properties
everywhere. For example, in the configuration above, all three variables
Proband, user_id, Pseudonym PIA
are mapped to participant_id which will have the same title
and description in all data sets (expand as follows: object >
variables > participant_id > original).
The configuration contains the following elements:
pipeline_ouput: what to do with all the variables and
data generated when running the processing pipeline
save: true or false, whether
to save the outputfile: string, path to the file where the output is
savedconvert_time_from_germany: true or
false, whether to keep the date-times as they are in the
data set assuming the UTC time, or to assume they were measured in
Germany and convert them to UTC, accounting for winter and summer times
(as of 2022)match_via_hub: true or false,
whether to use the participant pseudonyms provided by HUB as
intermediate keys to match final participant pseudonyms and sample
ID’sremove_variables: a set of strings, variables to remove
from the data sets during processingremove_entries_with_values: a list of variables, each
associated with a set of values for which any corresponding entry in any
data set is removed during processingremove_entries_ambiguous_rna_samples: true
or false, whether to remove RNA swab samples (identified by
sample ID’s “zifco-11” followed by 8 digits) with which no participant
can be associated unambiguouslynako: a list of configurations specific to the export
for the NAKO
application_number: number or string, the NAKO
application number, used to name variables as well as export folderkeep_datasets: set of strings, names of data sets to
exportremove_vars: set of strings, variables to remove from
the exportguess_scale_level: true or
false, whether the program should guess the scale level
(metric, ordinal, nominal, …) of a variable based on it’s type (integer,
character, date, …)missing_codings: string, how missing values are to be
encoded; currently only the value "-1" is acceptedmissing_options: string, what to print in the NAKO
meta-data as option (fixed set of possible values that a variable can
take); currently only the value "-1 = Fehlende Rohdaten" is
acceptedexport_path: string, path to directory were the NAKO
files should be writtendataset_types: list of broad categories of data-sets,
currently used for organizing the NAKO export
title: string, title of the data-set type; used for
NAKO exportdescription: string, description of the data-set type;
used for NAKO exportdatasets: a list of the data-sets to be imported,
processed, and/or exported
all_data_sets: a list of configurations that apply to
all data sets; each can be omitted or set to null and
defined for each data-set individually instead
raw_dir: string, directory where to find the
raw-data-set files,save_native: true or false,
whether to save the imported raw data sets in the R-native format
RDSread_native: true or false,
whether to import raw data sets in the R-native format RDS instead of
the original files (can speed up to import significantly)native_dir: string, path to the directory where the raw
data in RDS format should be saved or read fromsave_processed: true or
false, whether to save the processed data sets
(automatically in the R-native format RDS)processed_dir: string, path to directory where the
processed data sets are savedraw_data_file: string, path to original file or folder
containing the raw data-set; is appended to raw_dir defined
aboveraw_data_native_format_file: string, path to native
(RDS) file containing the raw data-set; is appended to
native_dir defined abovedataset_type: string, which data-set type does this
data set belongs to; this should be the same name {data-set-type name}
as appears in the list abovetitle: string, title of the data-set; used for NAKO
exportdescription: string, description of the data set; used
for NAKO exportvariables: a list of variables for which specific
configurations apply; other variables present in the data sets but
omitted here will be processed in a default manner
null) beside
original
original: set of strings, variable names, across
data-sets, to be mapped to this one variabletype: string, either boolean,
character, date, date_time,
integer, or float; if omitted or
null, the variable is treated as characterunit: string, the unit of the quantity stored in the
variable; used for NAKO exportscale_level: string, “scale level” (Skalenniveau) of
the variable; can be either “metrisch”, “nominal”, “ordinal”,
“hierarchisch” or “Text”, used for NAKO exportdataset: string, the name, as listed above, of the data
set to which the original variable belongs; this is
relevant when two variables in two data sets have the same name but
should be mapped to different variablesThe data-processing pipeline consists in six main steps:
Furthermore various checks and statistics are computed to be used in this reports.
All this is executed from the R script pipeline.R, which itself calls all necessary functions, defined in the various scripts in the R folder.
Depending on the configuration, the raw data sets are imported either from the original files or from files in the R-native RDS format. The latter are obtained by importing from the original file once in R and saving from R in the RDS format.
The original files can be either RDS, JSON, Excel, or CSV files. In
the first two cases variable types are conserved; in the latter two
cases, values are imported as characters (at this stage it’s more stable
and less error-prone not to try and automatically guess the types). If
raw_data_file in the configuration file points towards a
folder, then all files within it are imported in one list (each file is
imported as one element of the list).
The code book is converted from a nested list to a table with only relevant variables kept. The variable types are set explicitly. The questionnaire version is read from the corresponding file name: if the latter ends with “(x)”, where “x” is a number, then th version is x+1, otherwise the version is 1. For example, “2. Beobachtungsfragebogen AGI.json” is version 1 of that questionnaire, while “2. Beobachtungsfragebogen AGI(4).json” is version 5.
A variable “question_single” which identifies unique questions is
built to match the corresponding variable in answers. It is
the concatenation of questionnaire_name, “_v”,
questionnaire_version, “f”,
question_position, ””,
answer_position.
The variable is_decimal is discrded because it is not
obvious to interpret: it seems to be either missing or
FALSE, the latter always for enumerable variables usually
starting with “How many days …”. The variable
answer_type_id is left, but what it means is unclear as
well: does it give the type of the answer, like integer, character etc.?
It is never missing and can be any integer between 1 and 8. Lastly,
answer_level and answer_position always take
them same value; both are kept pending a choice on the most significant
one.
Variable names are replaced according to the config file: each time a
variable listed under the original property appears
(possibly, in selected data sets), it is replaced by the name of the
higher-level element. In our previous example, each time a variable is
called Proband it replaced by proband_id.
When variables don’t appear in the configuration, a default
standardization is applied (the function make_clean_names
of the R-package janitor): unique and consist only of the _ character,
numbers, and letters.
At the end, it is made sure that two variables that haven’t been listed under the same name don’t have the same name attributed to them.
The cleaning step consists itself in two steps:
The values of (see section Standardization of variable names for original names and data sets):
initial_quantity have comma replaced with dot, ” g”
removed, value “99,00 g” replaced with missing;remaining_quantity have comma replaced with dot and ”
µl” removed;concentration have comma replaced with dot and ” xE”
replaced with “e” (which is then interpreted as 10^6);consent_blood_sample_collection,
consent_result_communication,
consent_sample_collection, test_participant
have “Ja” or “ja”, respectively “Nein” or “nein”, replaced with
TRUE respectively FALSE; other values are
replaced with missing;collection_date, delivery_date,
analysis_date, reporting_date,
questionnaire_date, answer_date have the
suffix “:00” added if, when their digits are removed, they are equal to
“.., :”; this allows for proper date-time conversion lateranswer_is_mandatory have “t” respectively “n” replaced
with TRUE respectively FALSE; other values are
replaced with missing;participant_id, participant_id_hub,
sample_id, bakt_sample_id are tested and
replaced with missing if they don’t fit the predefined formats (see
details in section Format of ID
variables)Each variable is then converted to the type as indicated in the config file. If no type is indicated, no conversion is applied, the variable stays as it is. (Character when importing from Excel or CSV.)
The following types are directly converted in R: “boolean” (called logical in R), “character”, “integer”, “float” (double precision in R).
The two other possible types, “date” and “date_time”, require more
processing. We allow for half of the entries to be ill-formatted. The
values are thus assumed to be either: - for more than half of them,
integers or strings containing only digits, which is the internal
numeric representation of dates in R and other languages; - if the
original raw-data file was Excel (identified by extension .xls or
.xlsx), for more than half of them, numeric or strings containing only
digits and possibly one dot; we assume Excel for Windows, Excel 2016 for
Mac, or Excel for Mac 2011 was used (other reference used for earlier
versions for Mac); - or mostly well formatted strings; then we try to
guess using the function parse_date_time of the R package
lubridate.
Date encoding and decoding in Excel are discussed on Microsoft’s website and Stackoverflow.
If the format is “date”, the variable is converted to the date
format, i.e., only contains information on year, month and day.
Otherwise, if the option to convert from German time is set to
true in the config file, then the time zone is assumed to
be Germany with daylight saving times as of 2022: we first read setting
time zone to UTC, and then have to remove one or two hours depending on
daylight saving.
The following transformations are then applied to the data sets.
The variables sample_id and bakt_sample_id
are separated, the latter renamed sample_id, and appended
to the data set. This is relevant for the data set samples,
see the previews of it before and after processing in the section Previews before and after
processing. Note that although the direct link between
sample_id and bakt_sample_id is broken, no
information is actually lost, as they are still both linked through
participant_id. See section RNA tests on nasal swabs for
details and difficulties.
The entry of cpt_hub with HUB participant ID
participant_id_hub equal to “hzif0386” and no delivery date
is removed, as according to a colleague it was duplicated by error.
Test participant as identified in the data set consent
are removed from all data sets.
Variables (currently ids and comment) and
individual entries (currently a set of individual order number
(Auftragsnr), see section Nasal swabs)
flagged in the config file are removed from all data sets.
Samples considered irrelevant are removed from all data sets. This can be the case when: - one sample is linked to two different participants while having status “nicht genommen” (not taken) for one of the two; apparently it could happen that the same sample ID could be re-used for a participant following another one from whom no sample was taken contrary to what had been foreseen; - if, after having been set to lower case, a sample ID starts with either “rsist-” or “rfee-”; these correspond to other studies and shouldn’t be part of the ZIFCO data sets.
answers_backup is converted to the same format as
answer (e.g., one question ID containing questionnaire name
and question position instead of having them separated in two columns,
or having the same column names).
Valid answers only are retained from multiple answers in the filled
PIA questionnaires answers and answers_backup,
see section Multiple answers for
details.
Only the answers corresponding to the questionnaires
“Spontanmeldung”, “Regionsfragebogen”, “Symptome Atemwege”,
“Spontanmeldung: Symptome Atemwege” are kept in
answers_backup, as these are apparently the only one
missing from the normal export answers (this hasn’t been
checked). They are then added to answers.
The variables answer_values and
answer_values_code are removed from answers as
they are in the code book. All answers for which we can’t find the
corresponding question and answer options in the code book are
removed.
Lookup tables, matching participant pseudonym and sample ID’s are generated and one defined as the one to be used. See section Matching participant and sample ID’s for details.
cpt_pia and cpt_hub are merged via sample
and participant ID’s to produce the data set cpt.
pbmc_1 and pbmc_2 are merged via sample and
participant ID’s to produce the data set pbmc.
A column containing the participant pseudonym is added to all data sets where it is not already present by matching to the samples via the lookup table. In case a participant pseudonym are present in the data set but missing for some entries, if a non-missing pseudonym is found after matching with a sample, then the non-missing value replaces the missing one.
The processed data sets are further transformed to meet the NAKO’s requirements, meta-data are generated, and all are exported as CSV’s in the required formats. See the previews in the sections Data for NAKO and Meta-data for NAKO.
One specific further data-set operation is the merging of
answers with pia_codebook: the information on
each question and answers are added in answers itself
because the NAKO wants data sets that all have a one participant per
row. (An alternative would be to print question text, answer options,
etc., in the NAKO meta data, but this would be very cumbersome to use
and possibly not possible, as the text in Excel cells might exceed a
maximum allowed size.)
The requirements on how to contribute data to NAKO are described (in German) in the document “TFS-Info-12a”, see the accompanying file, also available online.
Briefly:
Meta-data, describing individual variables, have to be provided in one of two specific formats (Excel template or CSV). They consist of one table with, titles and descriptions of data sets and variables; names, units, “scale level” (whether ordinal, nominal, …), and “option” (for variables that have a fixed set of possible values, what those are).
Data are provided as CSV with a few specific requirements.
Variable names have to start with “u” followed by the NAKO-application number and “_“, and overall have 20 or less characters. Moreover, missing values are not directly written in the column for a given variable, but rather as a separated variable in a separated column. It had the same name as the original variable with the suffix”_m” added. Thus, if all variable names have to be shorter than 20 characters, then those of the original variables should actually be 18 or less character long.
To ensure that two different variables have different names after being shortened, a suffix with “_” and a number is added to the 16 first characters. (This works if their are 9 or less different variables with identical names after shortening.)
Missing values should be encoded as a negative integer with the meaning of each possible value described in the column “Option”. The document suggests an encoding with four values, “-1”, “-2”, “-3”, “-4”; at the moment everything is “-1” = “Fehlende Rohdaten”. This seems to correspond to the vast majority of the missing values appearing in the exported data sets.
The following table shows how often original variable names appear within and across the raw data sets:
After visual inspection and discussions:
The PIA/NAKO participant ID’s participant_id have one
format. After converting to lower case: “l3pia[9 digits]”
The HUB participant ID’s participant_id_hub have two
formats. After converting to lower case: “hzif[3 digits]” and “hzif[4
digits]”
The sample ID’s sample_id have different formats for
different analyses. After converting to lower case:
Moreover, two further sets of samples were found that matched other studies and were removed from all data sets. After converting to lower case: starting with “rsist-” and starting with “rfee-”.
Example of ID’s: participant_id: L3pia506545651,
l3pia986165890, l3pia334237262, l3pia069361605, l3pia459107441,
l3pia797835460; participant_id_hub: HZIF1056, HZIF0680,
HZIF0603, HZIF533, HZIF1317, HZIF0739; sample_id:
340528686, zifco-1249102811, ZIFCO-1076497429, ZIFCO-1009011348,
222115221, ZIFCO-1036456062.
The following values were ill-formatted and replaced with missing
values: participant_id: l3pia0987654321;
participant_id_hub: ZIFCO_Proband_01, ZIFCO_Proband_02;
sample_id: 12081497051, A, A2, B, B2, C, C2.
The corresponding data entries are:
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
After processing, the following data sets have these percentages of entries that have missing sample ID: answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 0%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0.1%, ids_lookup_1: 0%, ids_lookup_2: 0.01%, ids_lookup: 0%, pbmc: 0%, cpt: 0.26%. Here are the corresponding entries:
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
[no preview of individual- or sample-based data in non-confidential report]
participant_id is the PIA ID which was provided by NAKO
and thus the primary participant ID. participant_id_hub was
given by labs upon sample analysis. Thus we don’t necessarily expect a
one-to-one matching of participant_id and
participant_id_hub. Samples should have always exactly one
associated participant_id and at least one
participant_id_hub.
The matching was already done in the data set
nasal_swabs_pcr. The cpt data set was
comprised of cpt_hub which had sample ID’s and information
on samples, and cpt_pia with the same sample ID’s and
participant ID’s: thus here as well the matching could be done
directly.
There is no data set containing a direct pairs of participant ID’s and sample ID’s as found in PBMC and plasma. The only way to match those to particpants is via the HUB participant ID’s, see below.
Sometimes a sample ID for a participant would be re-used for a second
participant if, contrary what had been foreseen, a sample was not taken
for the first participant. Thus entries with
sample_seach tatus missing or “nicht genommen” and
corresponding to many different participants in a data set are removed.
This was the case in the data set samples.
There are two data sets related to swabs: swabs and
nasal_swabs_pcr. The first contain only information on
samples together with HUB participant ID’s, the second doesn’t have HUB
participant ID’s but has analysis results.
Overlap of sample ID’s:
swabs samples in
nasal_swabs_pcr: 29.01%nasal_swabs_pcr samples in
swabs: 65.14%Two approaches are applied to match participants and samples across data sets.
The first method consist in first directly recording the pairs
participant_id - sample_id present across data
sets. Then to add those found by joining participant_id -
participant_id_hub and participant_id_hub -
sample_id pairs. (This actually doesn’t come up, as there
is no data set with participant_id -
participant_id_hub pairs.) And lastly over two joins:
participant_id - sample_id with
sample_id - participant_id_hub and then with
sample_id- participant_id. (Even though all
HUB-participant pairs come together with a sample ID, it could be the
two samples are linked to the same HUB but only the first is linked to a
participant, in which case we can link the second over the associated
HUB, then the first sample, then its participant.) Pairs not found
directly are considered have been found via the HUB participant ID. This
produces the lookup table ids_lookup_1.
The second method consists in building all unique triplets
participant_id - participant_id_hub -
sample_id present in the data sets. This produces the
lookup table ids_lookup_2.
We find that the ID’s given by HUB participant_id_hub
don’t uniquely correspond to participants across data sets. Thus they
can’t be used to improve matching of sample ID’s sample_id
and participant ID’s participant_id.
Looking at sample-participant pairs obtained via HUB participant, we find the samples were already present, and the new participants are either different or missing:
[no preview of individual- or sample-based data in non-confidential report]
These are the same sample-participant pairs as those obtained from the triplets sample-participant-HUB participant that are not in the direct matching ample-participant observed in the data sets:
[no preview of individual- or sample-based data in non-confidential report]
In conclusion, without further information, it is safer not
to use the HUB participant ID to link (final) participant pseudonym and
sample ID. This means leaving the value
match_via_hub to false in the config file.
The final lookup table ids_lookup used to match
participant and sample is the first one ids_lookup_1
without the participant-sample pairs found only via HUB participant
ID’s.
After data set operations (filtering, merging):
Out of 13759 sample ID’s with at list one match in participant ID or HUB participant ID, 0.04% (6) have more than one corresponding participant ID’s and 0% (0) samples have no corresponding participant pseudonym.
Percentages of samples without matching participant in each data set: answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 0%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0%, ids_lookup_1: 0%, ids_lookup_2: 0%, ids_lookup: 0%, pbmc: 0%, cpt: 0%.
When a sample has multiple matching participant ID’s across data sets, a missing participant ID is attributed in data sets without participant ID’s. A special rule in applied to RNA tests on swabs, see below.
When a sample doesn’t have a matching participant ID, it is not included in export for NAKO.
For each data set, percentage of sample ID’s that appear more than once (expected is once): answers: N.A., pia_codebook: N.A., examination: N.A., nasal_swabs_pcr: 100%, plasma: 0%, samples: 0%, consent: N.A., swabs: 0%, ids_lookup_1: 0.02%, ids_lookup_2: 0%, ids_lookup: 0.02%, pbmc: 0%, cpt: 0.04%. Some of those are due to missing sample ID’s, which then can appear many times, see above.
We expect samples to appear once, except for
nasal_swabs_pcr since samples are tested for up to 20
targets.
[no preview of individual- or sample-based data in non-confidential report]
These are at the same time the entries with missing sample ID’s (and
most other variables missing as well) and physician values
different from the rest. They were identified by their
order_number (originally: Auftragsnr) and
removed from the data set. Below the corresponding raw data.
However we stil don’t find a consistent number of results for each
sample in nasal_swabs_pcr: here is the distribution of
multiplicities of sample ID’s: 1 sample(s) with multiplicity 0, 1
sample(s) with multiplicity 3, 70 sample(s) with multiplicity 10, 21
sample(s) with multiplicity 11, 2 sample(s) with multiplicity 12, 1
sample(s) with multiplicity 14, 266 sample(s) with multiplicity 15, 21
sample(s) with multiplicity 16, 54 sample(s) with multiplicity 17.
Indeed it can happen that not all samples are tested for all
targets.
Swabs can be analyzed via UTM (ID’s of the form zifco-10[8 digits])
or RNA (zifco-11[8 digits]). There are ambiguities regarding the latter
in data set samples: the same sample ID for RNA, originally
called Bakt_Proben_ID, sometimes appear many times with
different UTM ID’s (originally: Proben_ID) and participant
ID’s (Proband). Before any processing, this happens for
6.19% of the entries of the raw samples data set. The
variables Status and Bemerkungen might have
indices on how to interpret this fact, but no obvious ones. Below the
corresponding data:
[no preview of individual- or sample-based data in non-confidential report]
For lack of further information, and given that at the time
of writing no results are available for RNA tests, the samples with
ambiguous matching of RNA-ID and participant ID are removed from all
data sets (samples, swabs). This is done after separating UTM
and RNA ID’s in samples, thus no information is lost for
the first and the link between the two is maintained via the participant
ID.
After processing, number of duplicated entries (if an entry appears twice, it’s counted once) in each data set: no data set has duplicates.
Both data sets answers and answers_backup
contain information on individual questions and possible answers. This
can be compared to the code book to check that code book and PIA export
indeed are compatible.
To compare the questionnaire information contained in
pia_codebook, in answers, and in
answers_backup, we can join them (after cleaning, before
data set operations such as filtering or merging) using the question
identifier question_single and keeping only the relevant
variables. We add three variables to investigate the agreement:
answer_values_cb_answers_agree: whether
answer_valuesis the same in the PIA code book and in the
PIA answersanswer_values_code_cb_answers_agree: whether
answer_values_codeis the same in the PIA code book and in
the PIA answersanswer_values_code_cb_answers_backup_agree: whether
answer_values_codeis the same in the PIA code book and in
the backup.Since answer_values from answers_backup is
formatted differently and would be cumbersome to parse, we don’t
systematically compare it to the other data sets. This is the
result:
All questionnaires (including their version) with a mismatch are: “2. Beobachtungsfragebogen AGI_v1”, “2. Beobachtungsfragebogen AGI_v2”, “2. Beobachtungsfragebogen Vaginose_v1”, “Beobachtung Vaginose_v1”, “Corona-Impfungen_v1”, “Fragebogen Impfungen I_v1”, “Fragebogen Impfungen II_v1”, “Impfung Corona _v1”, “Impfung Corona _v3”, “Impfung Corona _v4”, “Lebensqualität_v1”, “Nutzerakzeptanz_v1”, “Spontanmeldung: Beobachtungsfragebogen Vaginose_v1”, “Spontanmeldung_v1”, “Spontanmeldung_v2”, “Symptome Vaginose_v1”.
There is only one true mismatch, where a question of the questionnaire “Spontanmeldung” seems to correspond to one questionnaire version in the code book and another in the PIA answers, presumably follow-up questions were added or removed. Pending further investigation, the corresponding questions (the second question and its follow-ups in the versions 1 and 2 of the “Spontanmeldung” questionnaire, i.e., identified as “Spontanmeldung_v1_f2_1”, “Spontanmeldung_v1_f2_2”, “Spontanmeldung_v2_f2_1”, “Spontanmeldung_v2_f2_2”, “Spontanmeldung_v2_f2_3”, “Spontanmeldung_v2_f2_4”) are removed.
All other mismatches are explained by the absence of the corresponding questionnaire from the code book. Those questionnaires are sometimes found in the backup and sometimes aren’t. (“Nutzerakzeptanz” is there but only in version 2.)
The solution chosen consists in removing all answers to questions not found in the code book: they are useless if one doesn’t know the corresponding question.
To limit computation times, this section is based on a random
sample of 10% of answers.
Before data set operations (filtering, merging), of the 120991648
values stored in answers, 36.74% are missing.
In particular there are many entries with same
question_id and participant_id, many
questionnaire_date months apart and
answer_values_code = answer_values == "[]". Overall 58.41%
have empty codes and values. However there are some with answer dates
(0% of entries with empty codes and values) and, of those, some with
answers (0% of entries with empty codes and values).
Overall 58.41% of entries have at the same time empty answer, answer date, answer codes and answer values. Overall 98.15% of data have a missing answer, 95.79% have missing answer date.
No processing is done in that regards. In the future, it might make sense to separate in one data set only useful for quality assessment, and one useful both for quality assessment and scientific questions. Furthermore there surely is a more efficient way to store, here the code book could maybe help.
To limit computation times, this section is based on a random
sample of 10% of answers.
Before data set operations (filtering, merging):
When present the fourth and last number in question_id,
of the form "_a*", is which time the question has been
raised or answered by the same participant for the same questionnaire
(they have a certain amount of time to change an answer once). Then the
answer with highest number is the latest and the one to keep.
However in 4.14% of the time we don’t find the expected combinations
of either no "_a*", or "_a1" and
"_a2" once each for each questionnaire date. (This is after
removing test participants.) This is most often one question appearing
only once with either "_a1" and "_a2", but a
few other cases appear as well (0.33% overall). Presumably the vast
majority of those have one or two missing answer dates.
Processing:
Remove "_a*" from the question name storing the
* as number (set to 0 if there is no suffix
"_a*"), and keep the entry with highest number. Stop here
pending further discussions. Check that, when defined, questions with
_a2 always have later answers than same question with
_a1, and similarly for _a* and no suffix.
One could go further and keep only the most recent answer for a given participant, question and questionnaire date-time, where an entry with an answer date has precedence over an entry with missing answer date. But this should be discussed and deeper investigated first.
Parts of the code used for data processing were inspired by and/or checked against code previously written by Irina Jansen. Requirements and many details were contributed and explained by Stefanie Castell, Jana Heise, and Irina Jansen.
5.2 Comments
For safety the different
commentcolumns are currently removed during processing. They might contain information on the quality of samples, and currently don’t seem to contain sensitive information, but they are difficult to use in a systematic way. Closer inspection is needed.Below are entries with a comment variable and non-missing values:
cpt_pia
[no preview of individual- or sample-based data in non-confidential report]
nasal_swabs_pcr
[no preview of individual- or sample-based data in non-confidential report]
pbmc_2
[no preview of individual- or sample-based data in non-confidential report]
samples
[no preview of individual- or sample-based data in non-confidential report]